api key
Password-Activated Shutdown Protocols for Misaligned Frontier Agents
Williams, Kai, Subramani, Rohan, Ward, Francis Rhys
Frontier AI developers may fail to align or control highly-capable AI agents. In many cases, it could be useful to have emergency shutdown mechanisms which effectively prevent misaligned agents from carrying out harmful actions in the world. We introduce password-activated shutdown protocols (PAS protocols) -- methods for designing frontier agents to implement a safe shutdown protocol when given a password. We motivate PAS protocols by describing intuitive use-cases in which they mitigate risks from misaligned systems that subvert other control efforts, for instance, by disabling automated monitors or self-exfiltrating to external data centres. PAS protocols supplement other safety efforts, such as alignment fine-tuning or monitoring, contributing to defence-in-depth against AI risk. We provide a concrete demonstration in SHADE-Arena, a benchmark for AI monitoring and subversion capabilities, in which PAS protocols supplement monitoring to increase safety with little cost to performance. Next, PAS protocols should be robust to malicious actors who want to bypass shutdown. Therefore, we conduct a red-team blue-team game between the developers (blue-team), who must implement a robust PAS protocol, and a red-team trying to subvert the protocol. We conduct experiments in a code-generation setting, finding that there are effective strategies for the red-team, such as using another model to filter inputs, or fine-tuning the model to prevent shutdown behaviour. We then outline key challenges to implementing PAS protocols in real-life systems, including: security considerations of the password and decisions regarding when, and in which systems, to use them. PAS protocols are an intuitive mechanism for increasing the safety of frontier AI. We encourage developers to consider implementing PAS protocols prior to internal deployment of particularly dangerous systems to reduce loss-of-control risks.
MARSAD: A Multi-Functional Tool for Real-Time Social Media Analysis
Biswas, Md. Rafiul, Alam, Firoj, Zaghouani, Wajdi
MARSAD is a multifunctional natural language processing (NLP) platform designed for real-time social media monitoring and analysis, with a particular focus on the Arabic-speaking world. It enables researchers and non-technical users alike to examine both live and archived social media content, producing detailed visualizations and reports across various dimensions, including sentiment analysis, emotion analysis, propaganda detection, fact-checking, and hate speech detection. The platform also provides secure data-scraping capabilities through API keys for accessing public social media data. MARSAD's backend architecture integrates flexible document storage with structured data management, ensuring efficient processing of large and multimodal datasets. Its user-friendly frontend supports seamless data upload and interaction.
- Asia > Middle East > Qatar (0.06)
- South America > Brazil (0.04)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
Are Kids Still Looking for Careers in Tech?
Are Kids Still Looking for Careers in Tech? AI is changing what careers are possible for students interested in STEM subjects. WIRED spoke with five aspiring scientists to find out how they're preparing for the future. Today's high school students face an uncertain road ahead. AI is changing what skills are valued in the job market, and the Trump administration's funding cuts have stalled scientific research across disciplines.
- North America > United States > California (0.14)
- Europe > Slovakia (0.04)
- Europe > Czechia (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Government > Regional Government > North America Government > United States Government (0.89)
- Health & Medicine > Therapeutic Area > Neurology (0.69)
- Education > Educational Setting > K-12 Education > Secondary School (0.56)
Early Signs of Steganographic Capabilities in Frontier LLMs
Zolkowski, Artur, Nishimura-Gasparian, Kei, McCarthy, Robert, Zimmermann, Roland S., Lindner, David
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances like using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.
- Europe > Switzerland > Zürich > Zürich (0.40)
- Africa > Middle East > Algeria (0.04)
- Africa > Niger (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Government (0.67)
- Energy > Renewable > Solar (0.46)
CLAPP: The CLASS LLM Agent for Pair Programming
Casas, Santiago, Fidler, Christian, Bolliet, Boris, Villaescusa-Navarro, Francisco, Lesgourgues, Julien
We introduce CLAPP (CLASS LLM Agent for Pair Programming), an interactive AI assistant designed to support researchers working with the Einstein-Boltzmann solver CLASS. CLAPP leverages large language models (LLMs) and domain-specific retrieval to provide conversational coding support for CLASS-answering questions, generating code, debugging errors, and producing plots. Its architecture combines multi-agent LLM orchestration, semantic search across CLASS documentation, and a live Python execution environment. Deployed as a user-friendly web application, CLAPP lowers the entry barrier for scientists unfamiliar with AI tools and enables more productive human-AI collaboration in computational and numerical cosmology. The app is available at https://classclapp.streamlit.app
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
Kutasov, Jonathan, Sun, Yuqi, Colognese, Paul, van der Weij, Teun, Petrini, Linda, Zhang, Chen Bo Calvin, Hughes, John, Deng, Xiang, Sleight, Henry, Tracy, Tyler, Shlegeris, Buck, Benton, Joe
As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and (c) avoid detection, we find that the best performing frontier models score 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) as sabotage agents when overseen by Claude 3.6 Sonnet. For current frontier models, success on the side task relies heavily on having access to a hidden scratchpad that is not visible to the monitor. We also use SHADE-Arena to measure models' monitoring abilities, with the top monitor (Gemini 2.5 Pro) achieving an AUC of 0.87 at distinguishing benign and malign transcripts. We find that for now, models still struggle at sabotage due to failures in long-context main task execution. However, our measurements already demonstrate the difficulty of monitoring for subtle sabotage attempts, which we expect to only increase in the face of more complex and longer-horizon tasks.
- North America > United States > Hawaii (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Asia > China (0.04)
- Research Report (1.00)
- Workflow (0.67)
- Law Enforcement & Public Safety > Terrorism (1.00)
- Information Technology > Security & Privacy (1.00)
Spill The Beans: Exploiting CPU Cache Side-Channels to Leak Tokens from Large Language Models
Side-channel attacks on shared hardware resources increasingly threaten confidentiality, especially with the rise of Large Language Models (LLMs). In this work, we introduce Spill The Beans, a novel application of cache side-channels to leak tokens generated by an LLM. By co-locating an attack process on the same hardware as the victim model, we flush and reload embedding vectors from the embedding layer, where each token corresponds to a unique embedding vector. When accessed during token generation, it results in a cache hit detectable by our attack on shared lower-level caches. A significant challenge is the massive size of LLMs, which, by nature of their compute intensive operation, quickly evicts embedding vectors from the cache. We address this by balancing the number of tokens monitored against the amount of information leaked. Monitoring more tokens increases potential vocabulary leakage but raises the chance of missing cache hits due to eviction; monitoring fewer tokens improves detection reliability but limits vocabulary coverage. Through extensive experimentation, we demonstrate the feasibility of leaking tokens from LLMs via cache side-channels. Our findings reveal a new vulnerability in LLM deployments, highlighting that even sophisticated models are susceptible to traditional side-channel attacks. We discuss the implications for privacy and security in LLM-serving infrastructures and suggest considerations for mitigating such threats. For proof of concept we consider two concrete attack scenarios: Our experiments show that an attacker can recover as much as 80%-90% of a high entropy API key with single shot monitoring. As for English text we can reach a 40% recovery rate with a single shot. We should note that the rate highly depends on the monitored token set and these rates can be improved by targeting more specialized output domains.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
Malware analysis assisted by AI with R2AI
Apvrille, Axelle, Nakov, Daniel
This research studies the quality, speed and cost of malware analysis assisted by artificial intelligence. It focuses on Linux and IoT malware of 2024-2025, and uses r2ai, the AI extension of Radare2's disassembler. Not all malware and not all LLMs are equivalent but the study shows excellent results with Claude 3.5 and 3.7 Sonnet. Despite a few errors, the quality of analysis is overall equal or better than without AI assistance. For good results, the AI cannot operate alone and must constantly be guided by an experienced analyst. The gain of speed is largely visible with AI assistance, even when taking account the time to understand AI's hallucinations, exaggerations and omissions. The cost is usually noticeably lower than the salary of a malware analyst, but attention and guidance is needed to keep it under control in cases where the AI would naturally loop without showing progress.
- Information Technology > Software > Programming Languages (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
Rabbit R1 security issue allegedly leaves sensitive user data accessible to anybody
The team behind Rabbitude, the community-formed reverse engineering project for the Rabbit R1, has revealed finding a security issue with the company's code that leaves users' sensitive information accessible to everyone. In an update posted on the Rabbitude website, the team said it gained access to the Rabbit codebase on May 16 and found "several critical hardcoded API keys." Those keys allow anybody to read every single response the R1 AI device has ever given, including those containing the users' personal information. They could also be used to brick R1 devices, alter R1's responses and replace the device's voice. The API keys they found authenticate users' access to ElevenLabs' text-to-speech service, Azure's speech-to-text system, Yelp (for review lookups) and Google Maps (for location lookups) on the R1 AI device. In a tweet, one of Rabbitude's members said that the company has known about the issue for the past month and "did nothing to fix it."
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Speech (0.59)
- Information Technology > Data Science > Data Mining > Big Data (0.40)
How To Create Your Own Auto-GPT AI Agent
To get good output from ChatGPT or another LLM, you usually have to feed it several prompts. But what if you could just give your AI bot a set of fairly broad goals at the start of a session and then sit back while it generates its own set of tasks to fulfill those goals? That's the idea behind Auto-GPT, a new open-source tool that uses the OpenAI API (same LLM as ChatGPT) to prompt itself, based on your initial input. We've already seen a number of Twitter users talk about how they are using Auto-GPT for everything from creating marketing plans to analyzing market data for investments to preparing topics for a podcast. Based on our hands-on experience, we can't say that it always works well (we asked it to write a Windows 11 how-to and the result was awful), but it's early days and some tasks may work better than others.